Skip to content

Drain LibMR threads to a safe point before fork() (MOD-15307)#97

Open
gabsow wants to merge 7 commits into
masterfrom
mod-15307-prefork-drain
Open

Drain LibMR threads to a safe point before fork() (MOD-15307)#97
gabsow wants to merge 7 commits into
masterfrom
mod-15307-prefork-drain

Conversation

@gabsow

@gabsow gabsow commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds MR_DrainForFork() / MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread), MR_DrainForFork():

  • parks the event-loop thread at a between-tasks safe point via a posted task, and
  • bounded-waits the worker pool to idle,

so no LibMR thread holds a libc lock at fork() (which would ghost-lock the child). MR_ResumeAfterFork() releases the parked event-loop thread (called after the fork, on success or cancel).

Cooperative, bounded, fail-open. Deliberately not the existing mr_thpool_pause (SIGUSR2), which can freeze a worker mid-malloc holding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.

Why / depends on

Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new FORK_CHILD_PRE subevent (redis/redis#15327).

Pre-merge note

The two RedisModule_Log(..., "notice", ...) lines in MR_DrainForFork/MR_ResumeAfterFork should be downgraded to debug (kept at notice to confirm the drain fires during CI validation).

🤖 Generated with Claude Code


Note

Medium Risk
Fork-time threading and allocator safety are sensitive; incomplete drain can still leave ghost locks, though behavior is bounded and fail-open with warnings.

Overview
Adds MR_DrainForFork() and MR_ResumeAfterFork() so embedding modules can quiesce LibMR immediately before fork(). MR_DrainForFork() posts a task that parks the event-loop thread at a between-tasks safe point, then bounded-waits (2s) for that park and for the execution worker pool to go idle via new mr_thpool_wait_timeout(). If quiescence is incomplete, it logs a warning and continues (fail-open). MR_ResumeAfterFork() unblocks the parked event-loop thread and must be paired with every drain (after fork success or cancel).

This is intentionally not mr_thpool_pause (SIGUSR2), which can stop workers mid-malloc and worsen ghost-lock risk.

Reviewed by Cursor Bugbot for commit df7cc09. Bugbot is set up for automated code reviews on this repo. Configure here.

Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's
FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via
a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a
libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2
mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked
event-loop thread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gabsow and others added 2 commits June 11, 2026 13:20
Logs the time spent quiescing the threads (excluding the fork that follows),
plus whether the event-loop thread parked and the busy-worker count, so the
pre-fork drain cost can be measured. Debug level so it is silent in production.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the
coordinator only logged a fixed "execution max idle reached" string with no clue
which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and
endpoint(s), replies received vs expected, the elapsed wait, and time since last
progress — at warning level so it is captured at the default loglevel (it
coincides with the user-visible failure and is rate-bounded by nMaxIdleReached).

Track responders by recording each ACK / NOTIFY_DONE sender node-id into a
heap-strings set on the Execution (freed in MR_FreeExecution, the single final
owner); pending = cluster peers minus that set, formatted by a new
MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug.
All new state is touched only on the event-loop thread, so no extra locking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gabsow gabsow marked this pull request as ready for review June 17, 2026 18:19
Comment thread src/mr.c Outdated
gabsow and others added 2 commits June 17, 2026 21:22
MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0,
ignoring jobs that are queued but not yet picked up by a worker. Such a job
could be dequeued and start running (and allocating) immediately after the
check, right as fork() runs -- defeating the drain. Add
mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue
to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot
"fork drain ignores queued jobs" finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f6fe687. Configure here.

Comment thread src/mr.c Outdated
Comment thread src/utils/thpool.c Outdated
mr_thpool_num_jobs_in_queue() read jobqueue.len without holding
jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data
race that could make MR_DrainForFork see a stale (zero) queue and fork early.
Take rwmutex for the read. Addresses the bugbot "unlocked job queue length
read" finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gabsow

gabsow commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@galcohen-redislabs wanted your opinion on the last commit (995f842) before we lean on it. It takes jobqueue.rwmutex when reading jobqueue.len in mr_thpool_num_jobs_in_queue, to address the Bugbot "unlocked job-queue length read" finding.

Two things I am not sure about:

  1. Is it really necessary? The neighbouring mr_thpool_num_threads_working() reads num_threads_working unlocked as well, and MR_DrainForFork polls both in a 1ms loop that is best-effort (it forks anyway on timeout) — so a stale len self-corrects on the next tick. By that logic the unlocked read is arguably benign (same as the existing one), and we could just reply to the bot instead of adding a lock.

  2. Isn't it dangerous? It adds a lock acquisition on the main thread, in the pre-fork drain path. As far as I can tell it's safe — the lock is taken and released inside the accessor (never held across fork()), and workers only hold rwmutex briefly in jobqueue_push/jobqueue_pull without needing the GIL, so there's no GIL/lock inversion or fork-time deadlock. But I would rather get your read before relying on it.

Happy to drop the lock and instead annotate the read as a benign race (matching num_threads_working) if you think that is cleaner/safer. What do you think?

Comment thread src/cluster.c Outdated
* `responded` set into "id(ip:port),..." in `out`, so a max-idle timeout can
* name the non-responding shard(s). Runs on the event-loop thread (same as the
* message handlers), so reading the cluster node table needs no extra locking.
* The node-id is the same NUL-terminated string used as the responded-set key. */

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole comment is too specific: no need to specify the ticket number and no need to explain the implementation and cause for this specific issue.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new diagnostic data of more explicit names (of workers that didn't respond after a timeout, etc.), including the needed locking mechanism and the monotonic clock, etc. are not really related to the core issue in this ticket (i.e., the forking coordination between the module and the redis core).

The whole thing is just a code slop and should not be included in this PR. If we really want such code (only if it is really a big issue, which honestly I don't think it is...) then we should open a specific ticket for that.

Comment thread src/cluster.c Outdated
Node* n = mr_dictGetVal(entry);
if (n->isMe) continue;
if (responded && mr_dictFind(responded, n->id)) continue;
int w = snprintf(out + off, outLen - off, "%s%s(%s:%u)",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size_t math is dangerous. As a rule of thumb you don't want to subtract two unsigned (for the fear of a wrap-around to large numbers).
Better to add an explicit guard here and change the comparison below to use additions only.

Comment thread src/cluster.h Outdated

size_t MR_ClusterGetSize();

/* MOD-14615: format peer shards (id(ip:port)) that are NOT in `responded` into

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too: no need to include ticket number or implementation details or reasoning for doing so or apologies about the struct declaration...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also in all other places; we don't need comments explaining how and why code was written to solve a specific ticket)

Comment thread src/mr.c
REDISMODULE_NOT_USED(ctx);
pthread_mutex_lock(&mr_forkDrainLock);
mr_forkElParked = 1;
pthread_cond_broadcast(&mr_forkDrainCond);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the already-existing sync. functions in utils?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: the broadcasting back and forth between ml and worker threads (with its extra locks and hidden locks such that in mr_thpool_num_jobs_in_queue()) pause a threat of increasing latencies for all commands. I would really like a much simpler solution, if possible.

- Wait for the worker pool to drain via the pool's own bounded idle wait
  (new mr_thpool_wait_timeout, reusing thcount_lock/threads_all_idle) instead of
  a hand-rolled poll loop plus a per-read jobqueue lock. Removes the hot-path
  lock contention that risked added command latency.
- Remove the max-idle "which shard didn't reply" diagnostics entirely
  (responded-sets, monotonic timestamps, dispatch/timeout logging,
  MR_ClusterFormatPendingPeers, mr_thpool_num_jobs_in_queue) -- out of scope for
  the fork-coordination fix; will be tracked in a separate ticket if needed.
- Trim comments (no ticket numbers / implementation rationale).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gabsow

gabsow commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

@galcohen-redislabs thanks for the thorough review — addressed all of it in df7cc09 (net −177/+42):

  • Removed the max-idle diagnostics entirely — the responded-sets, monotonic timestamps, dispatch/timeout logging, MR_ClusterFormatPendingPeers, and mr_thpool_num_jobs_in_queue. Agreed it's out of scope for the fork-coordination fix; tracked separately as MOD-16404 (your latency + size_t notes are captured there, and the original impl is referenced for whoever picks it up). Removing it also takes the size_t subtraction you flagged with it.
  • Simplified the drain to reuse existing sync — the worker wait is now a new mr_thpool_wait_timeout() that simply bounds the existing mr_thpool_wait on the pool's own thcount_lock/threads_all_idle. No hand-rolled poll loop, no per-read jobqueue lock, and no el↔worker broadcasting on the command path — so no added per-command latency. The whole drain is now: park the event-loop thread, then mr_thpool_wait_timeout(pool, 2s).
  • Trimmed the comments — dropped ticket numbers and the how/why-for-this-ticket explanations throughout.

PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants